Cython and Native Extensions

The Same Algorithm, Three Versions, 210x Apart

Before learning any Cython syntax, look at this benchmark. The algorithm is identical across all three implementations - a rolling sum of squares over an array:

# Version 1: Pure Python
def rolling_sum_squares_py(data: list, window: int) -> list:
    result = []
    n = len(data)
    for i in range(n - window + 1):
        total = 0.0
        for j in range(window):
            total += data[i + j] ** 2
        result.append(total)
    return result

import time
data = list(range(10_000))
window = 100

start = time.perf_counter()
for _ in range(1000):
    rolling_sum_squares_py(data, window)
py_time = time.perf_counter() - start
print(f"Pure Python: {py_time:.3f}s")

Pure Python: 4.21s

Now the Cython version with just cdef type declarations - same algorithm, same .pyx file structure:

# rolling.pyx - Version 2: Cython with cdef types
def rolling_sum_squares_typed(list data, int window):
    cdef int n = len(data)
    cdef int i, j
    cdef double total
    cdef list result = []

    for i in range(n - window + 1):
        total = 0.0
        for j in range(window):
            total += data[i + j] ** 2
        result.append(total)
    return result

Cython + cdef types: 0.18s  (23x speedup)

And finally with typed memoryviews - the critical Cython feature that eliminates Python object overhead on array access:

# rolling.pyx - Version 3: Cython with typed memoryviews
import numpy as np
cimport numpy as cnp

def rolling_sum_squares_mv(
    cnp.ndarray[cnp.float64_t, ndim=1] data,
    int window
):
    cdef double[::1] data_mv = data      # typed memoryview
    cdef int n = data_mv.shape[0]
    cdef int i, j
    cdef double total
    cdef cnp.ndarray[cnp.float64_t, ndim=1] result = np.empty(n - window + 1)

    for i in range(n - window + 1):
        total = 0.0
        for j in range(window):
            total += data_mv[i + j] * data_mv[i + j]
        result[i] = total
    return result

Cython + typed memoryviews: 0.020s  (210x speedup)

Version	Time	Speedup	What Changed
Pure Python	4.21 s	1x	Baseline
Cython + `cdef` types	0.18 s	23x	Static C types for loop variables
Cython + typed memoryviews	0.020 s	210x	Direct C-level array access, no PyObj

The 210x version is using the same nested loop. No algorithmic change. Cython eliminated the Python object overhead on every array access and gave the C compiler enough type information to generate efficient machine code.

What You Will Learn

Understand what Cython actually compiles to and how to read the annotation output
Set up a Cython build using setup.py and pyproject.toml
Declare static types with cdef, cpdef, and typed memoryviews
Release the GIL and use parallel loops with prange
Call C library functions from Cython
Use ctypes and cffi as alternatives to full Cython compilation
Know when NOT to use Cython

Prerequisites

Requirement	Level Needed
Python functions and modules	Comfortable
NumPy array basics	Familiar
C types (int, double, pointer)	Basic awareness
gcc or clang on the system	Required

Section 1: What Cython Actually Does

Cython is a superset of Python. Valid Python is valid Cython. The Cython compiler (the cython command) translates .pyx files into C code, which is then compiled by your system C compiler into a shared library (.so on Linux/macOS, .pyd on Windows) that Python can import.

my_module.pyx
     │
     │  cython my_module.pyx
     ▼
my_module.c          ← ~5000 lines of generated C
     │
     │  gcc -shared -fPIC ... my_module.c -o my_module.so
     ▼
my_module.so         ← importable from Python
     │
     │  import my_module
     ▼
Python call: my_module.rolling_sum_squares(data, 100)

The generated C code is real C - it handles Python reference counting, type checking at the boundaries, and calling conventions. What you write in .pyx determines how much of that overhead is present in the hot inner loop.

What the Generated C Looks Like

For a pure Python function in .pyx (no type declarations):

/* Generated C for: def add(x, y): return x + y */

static PyObject *__pyx_pw_6mymod_1add(PyObject *__pyx_self, PyObject *__pyx_args) {
    PyObject *__pyx_v_x = NULL;
    PyObject *__pyx_v_y = NULL;
    PyObject *__pyx_r = NULL;

    /* ... argument parsing ... */
    __pyx_t_1 = PyNumber_Add(__pyx_v_x, __pyx_v_y);  // Python-level addition
    /* ... reference counting ... */
    return __pyx_t_1;
}

That is the same overhead as pure Python. Now with types:

/* Generated C for: def add(double x, double y): return x + y */

static PyObject *__pyx_pw_6mymod_1add(PyObject *__pyx_self, PyObject *__pyx_args) {
    double __pyx_v_x;
    double __pyx_v_y;
    /* ... parse Python args into C doubles once ... */

    return PyFloat_FromDouble(__pyx_v_x + __pyx_v_y);  // C addition
}

The addition itself is now a native addsd instruction. The Python object overhead exists only at the function boundary (argument parsing and return value creation), not inside loops.

Section 2: Setting Up Cython

Installation

pip install cython numpy
# Ensure C compiler is present:
# macOS: xcode-select --install
# Linux: apt install build-essential
# Windows: Visual Studio Build Tools

`setup.py` Approach (Classic)

# setup.py
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np

extensions = [
    Extension(
        name="rolling",                    # import name
        sources=["rolling.pyx"],           # source file
        include_dirs=[np.get_include()],   # NumPy headers
        extra_compile_args=["-O3", "-march=native"],  # optimise aggressively
    )
]

setup(
    name="rolling",
    ext_modules=cythonize(
        extensions,
        annotate=True,           # generate rolling.html annotation
        compiler_directives={
            "language_level": "3",
            "boundscheck": False,   # skip array bounds checking (DANGER: only after testing)
            "wraparound": False,    # skip negative index support
            "cdivision": True,     # C division semantics (no Python ZeroDivisionError)
        },
    ),
)

Build:

python setup.py build_ext --inplace
# Creates: rolling.cpython-312-x86_64-linux-gnu.so (or similar)

`pyproject.toml` Approach (Modern)

# pyproject.toml
[build-system]
requires = ["setuptools", "cython", "numpy"]
build-backend = "setuptools.backends.legacy:build"

[tool.cython]
annotate = true

# setup.py (still needed alongside pyproject.toml for Cython)
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np

setup(
    ext_modules=cythonize([
        Extension("rolling", ["rolling.pyx"],
                  include_dirs=[np.get_include()])
    ], compiler_directives={"language_level": "3"})
)

Inline `%%cython` in Jupyter

For experimentation without a build system:

# In a Jupyter cell:
%load_ext Cython

%%cython --annotate
# cython: boundscheck=False, wraparound=False

def rolling_sum_cython(double[::1] data, int window):
    cdef int n = data.shape[0]
    cdef int i, j
    cdef double total
    cdef double[::1] result = data[:n - window + 1].copy()

    for i in range(n - window + 1):
        total = 0.0
        for j in range(window):
            total += data[i + j] * data[i + j]
        result[i] = total
    return result

The --annotate flag produces the HTML annotation inline in the notebook.

Section 3: Type Declarations

Type declarations are the core of Cython. Without them, Cython is slightly faster than Python. With them, it compiles to C that outperforms most handwritten C naive implementations.

Variable Types: `cdef`

# rolling.pyx

def compute_stats(data):
    """No type declarations - almost identical speed to Python."""
    n = len(data)
    total = 0.0
    for i in range(n):
        total += data[i]
    mean = total / n
    return mean

def compute_stats_typed(data):
    """With type declarations - C-speed inner loop."""
    cdef int n = len(data)
    cdef int i
    cdef double total = 0.0
    cdef double mean

    for i in range(n):
        total += data[i]  # still Python object access if data is a list
    mean = total / n
    return mean

Function Types: `cdef`, `cpdef`, `def`

Declaration	Callable from Python?	Callable from Cython?	Overhead
`def`	Yes	Yes (slow)	Full Python ABI
`cpdef`	Yes	Yes (fast)	Thin wrapper
`cdef`	No	Yes (fast)	None

# inner.pyx

cdef double _inner_compute(double x, double y) nogil:
    """Pure C function - not callable from Python."""
    return x * x + y * y

cpdef double compute(double x, double y):
    """Callable from both Python and Cython efficiently."""
    return _inner_compute(x, y)

def compute_batch(list xs, list ys):
    """Standard Python-callable function."""
    cdef int n = len(xs)
    cdef int i
    cdef list result = [0.0] * n

    for i in range(n):
        result[i] = _inner_compute(xs[i], ys[i])
    return result

The Cython Annotation HTML - Reading Yellow Lines

Run cython -a module.pyx to generate module.html. Open it in a browser.

cython -a rolling.pyx
open rolling.html

Each line of your .pyx code is coloured:

White: pure C - no Python interaction
Yellow: involves Python API calls - the brighter the yellow, the more Python overhead
Dark yellow / orange: heavy Python interaction - this is where you need to add types

Example annotation interpretation:

# Yellow (calls PyNumber_Multiply, PyObject boxing)
total += data[i] ** 2

# White (native C floating point multiply)
cdef double val = data_mv[i]
total += val * val

The annotation HTML is the most important Cython debugging tool. After adding type declarations, check that your inner loop lines have turned white.

Section 4: Typed Memoryviews - The Key to Array Performance

Typed memoryviews are the feature that makes Cython's array processing competitive with hand-written C. They provide direct C-level access to the memory of NumPy arrays, array.array, bytes, and any object exposing the buffer protocol.

Declaration Syntax

# 1D C-contiguous array of doubles
cdef double[::1] arr

# 1D Fortran-contiguous array
cdef double[:] arr_f

# 2D C-contiguous (row-major) array
cdef double[:, ::1] matrix

# 2D Fortran-contiguous (column-major)
cdef double[::1, :] matrix_f

The ::1 notation means "contiguous in this dimension" - equivalent to asserting that elements are laid out sequentially in memory without gaps. This allows the compiler to generate optimal load/store instructions.

Matrix Multiplication Example

# matmul.pyx
# cython: boundscheck=False, wraparound=False, cdivision=True

import numpy as np
cimport numpy as cnp

def matmul_python(A, B):
    """Pure Python matrix multiply - O(n³) with full overhead."""
    n = len(A)
    C = [[0.0] * n for _ in range(n)]
    for i in range(n):
        for j in range(n):
            for k in range(n):
                C[i][j] += A[i][k] * B[k][j]
    return C


def matmul_cython(
    double[:, ::1] A,   # C-contiguous 2D
    double[:, ::1] B,
):
    """Cython matrix multiply with typed memoryviews."""
    cdef int n = A.shape[0]
    cdef int i, j, k
    cdef double total
    cdef double[:, ::1] C = np.zeros((n, n), dtype=np.float64)

    for i in range(n):
        for j in range(n):
            total = 0.0
            for k in range(n):
                total += A[i, k] * B[k, j]
            C[i, j] = total

    return np.asarray(C)

Benchmark on 200×200 matrices:

matmul_python:  3.42s
matmul_cython:  0.021s   (163x speedup)
np.dot(A, B):   0.0003s  (BLAS - 11,400x vs Python)

Note: for matrix operations, NumPy's BLAS backend is still faster than Cython loops because BLAS uses hand-tuned assembly and AVX-512 instructions. Use Cython for operations that NumPy cannot express as a single vectorised call.

Rolling Window - A Realistic Use Case

NumPy does not have a built-in rolling-window arbitrary-function operation. Cython with memoryviews fills this gap:

# rolling_stats.pyx
# cython: boundscheck=False, wraparound=False

import numpy as np
cimport numpy as cnp

def rolling_mean_std(
    double[::1] data,
    int window,
):
    """
    Compute rolling mean and standard deviation.
    Returns two arrays: means, stds.

    Welford's online algorithm - numerically stable.
    """
    cdef int n = data.shape[0]
    cdef int out_len = n - window + 1
    cdef double[::1] means = np.empty(out_len, dtype=np.float64)
    cdef double[::1] stds  = np.empty(out_len, dtype=np.float64)

    cdef int i, j
    cdef double m, s, x, delta, delta2, M2

    for i in range(out_len):
        # Welford's algorithm over the window
        m = 0.0
        M2 = 0.0
        for j in range(window):
            x = data[i + j]
            delta = x - m
            m += delta / (j + 1)
            delta2 = x - m
            M2 += delta * delta2

        means[i] = m
        stds[i] = (M2 / window) ** 0.5

    return np.asarray(means), np.asarray(stds)

Comparison vs pandas rolling().mean() and .std():

rolling_mean_std (Cython):  0.043s for 1M points, window=50
pandas rolling:             0.089s for 1M points, window=50

The Cython version is faster because pandas uses Python-level aggregation for arbitrary window functions while this implementation stays entirely in C.

Section 5: GIL Release and Parallel Computation

The most powerful Cython capability for compute-intensive work is releasing the GIL and running loops in parallel across CPU threads.

`with nogil:` Block

Any Cython function that:

Uses only C types (no Python objects)
Calls only other nogil functions

...can release the GIL and allow other threads to run Python code concurrently.

# parallel.pyx
# cython: boundscheck=False, wraparound=False

from cython.parallel import prange
import numpy as np
cimport numpy as cnp
from libc.math cimport sqrt, exp

def apply_gaussian_kernel_serial(
    double[::1] data,
    double sigma,
):
    """Serial version - processes elements one by one."""
    cdef int n = data.shape[0]
    cdef double[::1] result = np.empty(n, dtype=np.float64)
    cdef int i
    cdef double coeff = -1.0 / (2.0 * sigma * sigma)

    for i in range(n):
        result[i] = exp(coeff * data[i] * data[i])

    return np.asarray(result)


def apply_gaussian_kernel_parallel(
    double[::1] data,
    double sigma,
    int n_threads=4,
):
    """
    Parallel version - releases GIL, uses prange for OpenMP threading.
    Each element is independent → embarrassingly parallel.
    """
    cdef int n = data.shape[0]
    cdef double[::1] result = np.empty(n, dtype=np.float64)
    cdef int i
    cdef double coeff = -1.0 / (2.0 * sigma * sigma)

    with nogil:
        for i in prange(n, num_threads=n_threads, schedule='static'):
            result[i] = exp(coeff * data[i] * data[i])

    return np.asarray(result)

Build with OpenMP:

# setup.py for parallel Cython
from setuptools import setup, Extension
from Cython.Build import cythonize
import numpy as np

extensions = [
    Extension(
        "parallel",
        ["parallel.pyx"],
        include_dirs=[np.get_include()],
        extra_compile_args=["-O3", "-fopenmp"],   # OpenMP flag
        extra_link_args=["-fopenmp"],
    )
]

setup(ext_modules=cythonize(extensions,
      compiler_directives={"language_level": "3"}))

Benchmark on 10M elements:

Serial:    0.182s
Parallel (4 threads): 0.051s   (3.6x speedup on 4 cores)

Thread Safety Requirements

To use prange safely, each loop iteration must be independent:

No shared mutable state written by multiple threads simultaneously
No Python objects (GIL is released - Python is not thread-safe without it)
No calls into Python code inside the nogil block

Violations cause data races (silently wrong results) or segfaults.

Section 6: Calling C Libraries from Cython

Cython can call any C function directly without the overhead of ctypes or cffi. This is the right approach when you need to wrap a performance-critical C library.

`cdef extern from` - Declaring C Functions

# math_ext.pyx

# Declare the C functions we want to use
cdef extern from "math.h":
    double sin(double x) nogil
    double cos(double x) nogil
    double sqrt(double x) nogil
    double fabs(double x) nogil

# Or use Cython's built-in C math declarations:
from libc.math cimport sin, cos, sqrt, fabs, M_PI

def compute_polar_to_cartesian(
    double[::1] r,
    double[::1] theta,
):
    """Convert polar coordinates to Cartesian - vectorised in C."""
    cdef int n = r.shape[0]
    cdef double[::1] x = np.empty(n, dtype=np.float64)
    cdef double[::1] y = np.empty(n, dtype=np.float64)
    cdef int i

    with nogil:
        for i in range(n):
            x[i] = r[i] * cos(theta[i])
            y[i] = r[i] * sin(theta[i])

    return np.asarray(x), np.asarray(y)

Wrapping a Custom C Function

Suppose you have a high-performance C library:

// fast_filter.h
double fast_ema(const double *data, int n, double alpha);

# fast_filter.pyx

cdef extern from "fast_filter.h":
    double fast_ema(const double *data, int n, double alpha) nogil

def exponential_moving_average(
    double[::1] data,
    double alpha,
):
    """
    Call into a C library function directly.
    The data memoryview provides a direct pointer to the array's memory.
    """
    cdef int n = data.shape[0]
    # &data[0] gives the pointer to the first element
    return fast_ema(&data[0], n, alpha)

Section 7: `ctypes` - Calling C Without Compilation

ctypes is the standard library solution for calling into shared C libraries from Python. No Cython compilation step required. Slower than Cython at the boundary but fine for infrequent calls into fast C functions.

import ctypes
import ctypes.util
import numpy as np
from pathlib import Path

# Load a shared library
libm = ctypes.CDLL(ctypes.util.find_library("m"))   # libm - C math library

# Declare the function signature
libm.sin.argtypes = [ctypes.c_double]
libm.sin.restype  = ctypes.c_double

# Call it
result = libm.sin(3.14159 / 2.0)
print(f"sin(π/2) = {result:.6f}")   # 1.000000

# Loading your own library
lib = ctypes.CDLL(str(Path(__file__).parent / "libfast.so"))
lib.fast_ema.argtypes = [
    ctypes.POINTER(ctypes.c_double),  # const double *data
    ctypes.c_int,                      # int n
    ctypes.c_double,                   # double alpha
]
lib.fast_ema.restype = ctypes.c_double

def ema_ctypes(data: np.ndarray, alpha: float) -> float:
    """Call fast_ema via ctypes - no compilation step."""
    assert data.dtype == np.float64 and data.flags['C_CONTIGUOUS']
    ptr = data.ctypes.data_as(ctypes.POINTER(ctypes.c_double))
    return lib.fast_ema(ptr, len(data), alpha)

`ctypes` for Structures

import ctypes

class ImageHeader(ctypes.Structure):
    """Match a C struct layout."""
    _fields_ = [
        ("width",    ctypes.c_uint32),
        ("height",   ctypes.c_uint32),
        ("channels", ctypes.c_uint8),
        ("depth",    ctypes.c_uint8),
    ]

header = ImageHeader(width=1920, height=1080, channels=3, depth=8)
print(f"Image: {header.width}x{header.height}, {header.channels}ch")

Section 8: `cffi` - The Modern C Integration API

cffi (C Foreign Function Interface) is more ergonomic than ctypes for complex C APIs. It parses actual C header syntax and is the preferred approach for wrapping large C/C++ libraries.

pip install cffi

ABI Mode (No Compilation)

from cffi import FFI

ffi = FFI()

# Declare the C functions exactly as in the header
ffi.cdef("""
    double fast_ema(const double *data, int n, double alpha);
    int process_batch(
        const double *input,
        double *output,
        int n,
        double threshold
    );
""")

# Load the shared library
lib = ffi.dlopen("./libfast.so")

import numpy as np

def ema_cffi(data: np.ndarray, alpha: float) -> float:
    """Call fast_ema via cffi - cleaner than ctypes for complex APIs."""
    assert data.dtype == np.float64
    # cffi can cast a numpy array's buffer to a C pointer
    c_data = ffi.cast("double *", data.ctypes.data)
    return lib.fast_ema(c_data, len(data), alpha)


def process_batch_cffi(input_arr: np.ndarray, threshold: float) -> np.ndarray:
    """Demonstrates in/out array pattern."""
    n = len(input_arr)
    output_arr = np.empty(n, dtype=np.float64)

    c_in  = ffi.cast("double *", input_arr.ctypes.data)
    c_out = ffi.cast("double *", output_arr.ctypes.data)

    result_code = lib.process_batch(c_in, c_out, n, threshold)
    if result_code != 0:
        raise RuntimeError(f"process_batch failed with code {result_code}")

    return output_arr

API Mode (With Compilation - Fastest)

from cffi import FFI

ffi = FFI()

ffi.cdef("""
    double fast_ema(const double *data, int n, double alpha);
""")

ffi.set_source(
    "_fast_lib",           # output module name
    """
    #include "fast_filter.h"
    """,
    sources=["fast_filter.c"],
    extra_compile_args=["-O3", "-march=native"],
)

if __name__ == "__main__":
    ffi.compile(verbose=True)

API mode compiles the C code and links it into a Python extension module. The resulting _fast_lib.so is importable and calls the C function with near-zero overhead.

Section 9: When NOT to Use Cython

Cython adds build complexity. It requires a C compiler, complicates CI/CD pipelines, and makes the code less accessible to contributors who do not know Cython syntax. Before reaching for it:

Decision Matrix

Situation	Use Cython?	Better Alternative
Numerical loop over NumPy arrays	Maybe	Numba @njit first - simpler
Custom operation NumPy cannot express	Yes	Cython typed memoryviews
Wrapping an existing C/C++ library	Yes	Cython or cffi
String/text processing bottleneck	No	Python re, regex, or Rust
I/O bottleneck (disk, network, database)	No	asyncio or better algorithm
Algorithm is O(n²), should be O(n log n)	No	Fix the algorithm first
Library already in NumPy/scipy/pandas	No	Use the library
One-time data transformation (not in hot path)	No	Not worth the complexity
Bottleneck is < 5% of total runtime	No	Profile better targets

The Cython Complexity Budget

Each .pyx file adds to your project's complexity budget:

CI must compile Cython before running tests
Wheels must be built for each Python version and platform (or require compilation on install)
Stack traces from .pyx files are harder to read
Debugging requires understanding both Python and C error domains

Rule of thumb: Cython is worth the complexity budget when the speedup is 10x or greater and the bottleneck accounts for at least 10% of total runtime. For smaller gains, prefer Numba (zero build complexity) or NumPy vectorisation.

Section 10: Compiler Directives Reference

These directives control Cython's safety vs. performance tradeoffs. Enable them in the file header or globally in setup.py:

# At the top of any .pyx file:
# cython: boundscheck=False, wraparound=False, cdivision=True, nonecheck=False

Directive	Default	Performance Effect	Safety Cost
`boundscheck=False`	True	5–30% speedup	Out-of-bounds access = segfault
`wraparound=False`	True	2–10% speedup	Negative indexing silently wrong
`cdivision=True`	False	3–15% speedup	Division by zero = C UB, not ZeroDivision
`nonecheck=False`	False	2–5% speedup	None access = segfault
`initializedcheck=False`	True	2–5% speedup	Uninitialised memoryview = segfault
`language_level=3`	2	Required	Must match Python version

Safety protocol: develop with all safety checks enabled (defaults). Only disable them after tests pass, and only for functions that have been verified correct.

Interview Questions

Q1: What is the difference between cdef, cpdef, and def in Cython? When would you use each?

def creates a standard Python function. It is callable from Python with the full Python calling convention - arguments are Python objects, the return value is a Python object. Inside the function, Cython can use cdef variable types to speed up local computation, but the function entry/exit pays Python overhead.

cdef creates a C-level function that is NOT callable from Python. It accepts and returns C types directly, has zero Python calling overhead, and can be declared nogil. Use cdef for internal helper functions called from within .pyx code that you never need to call directly from Python.

cpdef creates both a C version and a Python wrapper. When called from Cython, the C version is used (fast path). When called from Python, the Python wrapper is used. Use cpdef for functions that need to be both performance-critical when called from Cython AND accessible from Python (e.g., module-level API functions that also call each other internally).

Q2: What is a typed memoryview and why does it enable such large speedups over Python list access?

A typed memoryview is a Cython construct that wraps any object implementing the Python buffer protocol (NumPy arrays, bytearray, array.array, etc.) and provides direct C-level pointer access to the underlying memory.

Without memoryviews, accessing data[i] in a Cython function that receives a Python list involves:

A call to PyList_GetItem(data, i)
A bounds check
Returning a PyObject*
Unboxing the PyObject to extract the C value

With a typed memoryview double[::1] data, accessing data[i] compiles to a single C array dereference: *((double *)data.data + i * data.strides[0]) - approximately the cost of one memory load instruction.

For a loop over 10 million elements, the difference is: 10M Python API calls vs 10M C pointer dereferences. At ~50ns per Python call vs ~1ns per memory load, the speedup is proportional.

Q3: What are the boundscheck and wraparound Cython directives? When is it safe to disable them?

boundscheck=True (the default) causes Cython to insert an if i < 0 or i >= n check before every array access. This prevents out-of-bounds access from silently corrupting memory - instead you get an IndexError. The cost is one comparison and conditional branch per access - typically 5–30% overhead.

wraparound=True (the default) supports Python's negative indexing convention: data[-1] accesses the last element. Cython implements this by checking for negative indices and adjusting them before the access. Disabling it makes negative indexing produce incorrect results without an error.

It is safe to disable both when:

The function has been thoroughly tested with the defaults enabled
All loop bounds are provably within range (e.g., for i in range(n) where n = data.shape[0])
No negative indices are used anywhere in the function

The typical production workflow: develop with defaults, run tests, then add # cython: boundscheck=False, wraparound=False to the file header and verify tests still pass. If any test fails, the indexing logic has a bug that the checks were hiding.

Q4: How does prange work and what are the requirements for a loop to be safe to parallelise with it?

prange is Cython's parallel range, built on OpenMP. It distributes iterations of a loop across multiple threads, each with its own stack. The GIL must be released before prange is called (typically inside a with nogil: block).

Requirements for safe parallelisation with prange:

No Python objects: the loop body cannot create, access, or modify Python objects. The GIL is released.
No write-after-read hazards: if iteration i reads a value that iteration j might be writing simultaneously, the result is undefined. Each iteration must read from one region and write to a disjoint region (e.g., result[i] = f(data[i]) is safe; result[i] = data[i-1] + data[i+1] may not be, depending on thread scheduling).
Reduction variables must be declared: if you are accumulating a sum across iterations, use prange's reduction clause: for i in prange(n, nogil=True): total += data[i] - Cython automatically makes total a thread-private variable and reduces it at the end.
No dynamic allocation inside the loop unless thread-safe: malloc/free in the loop body is generally safe; Python allocations are not (GIL is released).

Q5: When should you prefer cffi over ctypes for calling C code from Python? When should you prefer Cython over both?

ctypes is the right choice for simple, infrequent calls into a C library when you cannot modify the build system. It ships with Python (no extra dependencies) and works without compiling anything. The API becomes unwieldy for complex C structures and function pointers.

cffi is preferred when: the C API is complex (many structs, callbacks, or pointer-heavy interfaces), you want to paste actual C header declarations instead of manually mirroring the types in Python, or you need API mode (compile-time linking) for maximum call speed. cffi is also the standard approach for PyPy compatibility.

Prefer Cython over both when: the bottleneck is in the loop body, not just the function call boundary. ctypes and cffi eliminate the Python-C boundary overhead but do nothing for code inside the loop - you still pay Python overhead for every Python operation inside the loop body. Cython compiles the entire function, including the loop interior, to C. For tight numerical loops, Cython typed memoryviews will significantly outperform ctypes/cffi wrappers around equivalent C functions, because the Cython version eliminates all Python overhead in the loop, while ctypes/cffi only eliminates the per-call overhead.

The Same Algorithm, Three Versions, 210x Apart​

What You Will Learn​

Prerequisites​

Section 1: What Cython Actually Does​

What the Generated C Looks Like​

Section 2: Setting Up Cython​

Installation​

setup.py Approach (Classic)​

pyproject.toml Approach (Modern)​

Inline %%cython in Jupyter​

Section 3: Type Declarations​

Variable Types: cdef​

Function Types: cdef, cpdef, def​

The Cython Annotation HTML - Reading Yellow Lines​

Section 4: Typed Memoryviews - The Key to Array Performance​

Declaration Syntax​

Matrix Multiplication Example​

Rolling Window - A Realistic Use Case​

Section 5: GIL Release and Parallel Computation​

with nogil: Block​

Thread Safety Requirements​

Section 6: Calling C Libraries from Cython​

cdef extern from - Declaring C Functions​

Wrapping a Custom C Function​

Section 7: ctypes - Calling C Without Compilation​

ctypes for Structures​

Section 8: cffi - The Modern C Integration API​

ABI Mode (No Compilation)​

API Mode (With Compilation - Fastest)​

Section 9: When NOT to Use Cython​

Decision Matrix​

The Cython Complexity Budget​

Section 10: Compiler Directives Reference​

Interview Questions​